Assignment 4: Clustering and classification

This week I have worked on clustering and classification. This week definitely felt much more easier for me.

date()
## [1] "Tue Nov 29 05:26:20 2022"

4.1: Data wrangling

This week, data wrangling felt even easier than the last week. I mostly used some help from create_alc.R. The R code of the data wrangling part is in the data folder of my Github repository. I will put the link here as well: https://github.com/bbayraktaroglu/IODS-project/blob/master/data/create_human.R

4.2 Analysis

Setting up the packages

library(MASS)
library(dplyr)
library(tidyr)
library(tidyverse)
library(corrplot)
library(ggplot2)
library(plotly)

4.2.1: Reading the dataset

# reading the required file for the assignment
data("Boston")
# checking out its dimension, structure and summary
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

In the Boston dataset, there are 506 observations and 14 variables. It is included in the MASS package of R. This data frame contains gathered data related to housing values in suburbs of Boston. Most of the variables are numeric (float), while “chas” and “rad” variables are integers.

4.2.2 Plots

Let’s put our newly learned knowledge about correlation plots to good use. The following is the correlation matrix and its various plots of the Boston data:

# calculating the correlation matrix, also round it to 2 digits
cor_matrix <- cor(Boston) %>% round(digits=2)

# print the correlation matrix
print(cor_matrix)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
library(corrplot)
corrplot(cor_matrix, method="circle")

corrplot(cor_matrix, method="number")

Observe that most of the variables are more or less correlated with each other, but the “chas” variable is mostly correlated with itself, while having correlation very close to 0 with the other variables. We know from basic probability theory that uncorrelated data does not imply independence, so we cannot infer that “chas” is independent from the other variables. We can only say that it is almost uncorrelated from the other variables. “rad” and “indus” has high overall positive correlation with most of the other variables (except “chas”). “rad” has 0.91 correlation with “tax” and 0.72 with “indus”. “indus” has -0.71 correlation (strong negative correlation) with “dis”, while “nox” has -0.77 correlation with “dis”.

4.2.3 Standardize the dataset and print scaled data summary

We will scale the data by subtract the column means from the corresponding columns and divide the difference with standard deviation. This normalizes the variables to be centered with standard deviation 1.

# scaling the Boston
boston_scaled <- as.data.frame(scale(Boston))

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We now create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). We will use the quantiles as the break points in the categorical variable.

We will then drop the old crime rate variable from the dataset. Afterwards, we divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

# creating a categorical variable called "crime" from scaled crime rate
boston_scaled$crim <- as.numeric(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- boston_scaled %>% dplyr::select(-crim)

# add the new categorical variable to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset
n <- nrow(boston_scaled) 

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# creating the train set
train <- boston_scaled[ind,]

# creating the test set 
test <- boston_scaled[-ind,]

4.2.4 Fit the LDA and draw its (bi)plot

We will now fit the linear discriminant analysis on the train set. We will use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. We then draw the LDA (bi)plot.

# linear discriminant analysis
lda.fit <- lda(crime ~ . , data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Observe that, the LDA plot predicts “rad” has the most variation in the dataset, towards the mostly “high” cluster.

4.2.5 LDA prediction

set.seed(123)

# saving the correct classes from test data
correct_classes <-test$crime

# removing the crime variable from test data
test <- dplyr::select(test, -crime)

# predicting classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulating the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14       8        2    0
##   med_low    5      18        8    0
##   med_high   0       5       17    1
##   high       0       0        0   24

We find that almost all of the results are accurately predicted. Correctly classified observations are about 67, while the rest (about 35) are incorrectly classified. The inaccuracy rate of the LDA is about 34% (can be as low as 23% in some other sampling with another other seed).

4.2.6 K-means clustering

We reload Boston, rescale it and compute its Euclidean distance.

# reload the data
data("Boston")

# scale the data again
boston_scaled <- as.data.frame(scale(Boston))

# compute the Euclidean distance of Boston
dist_eu <- dist(boston_scaled)

# summary of dist_eu 
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

We now run the k-means algorithm:

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

The above plot extensively shows us that there is a significant drop at the value 2. Thus, the optimal number of clusters is 2.

We now run k-means algorithm again, this time with 2 clusters, and plot the Boston dataset with the clusters. The clusters will be colored in red and black.

set.seed(123)

# k-means clustering with 2 clusters
km <- kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

If one zooms in to the plot above, one would see that “rad” has nicely separated clusters across all of the possible pairings. “tax” also has good separation of clusters. The other variables are a complete mess, and no other conclusion can be drawn.

4.2.7 Bonus: K-means algorithm and LDA

We will now perform k-means algorithm on the original Boston data (after scaling). We choose 5 clusters. We then perform LDA using the clusters as target classes. We will include all the variables in the Boston data in the LDA model.

set.seed(5)
# reload the data
data("Boston")

# scale the data again
boston_scaled <- as.data.frame(scale(Boston))

# k-means clustering with 5 clusters
km <- kmeans(Boston, centers = 5)

# linear discriminant analysis on the clusters, with data=boston_scaled, and target variable km$cluster
lda.fit <- lda(km$cluster ~ ., data = boston_scaled)

# target classes as numeric
classes <- as.numeric(km$cluster)

# plot the lda results. Note that lda.arrows is the same function we have used above
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influential linear separators for the clusters?

We observe in the above biplot that “tax” and “rad” have the most variation in the dataset. Moreover, the K-means seems to form accurate and separate clusters.

4.2.8 Super-Bonus: A cool 3D plot of LDA and K-means

We will recall the code for the (scaled) train data that we used to fit the LDA. We then create a matrix product, which is a projection of the data points.

set.seed(123)
# LDA

lda.fit <- lda(crime ~ ., data = train)

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

We now create a 3D plot of the columns of the matrix product:

library(plotly)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=~train$crime)

Now let’s run the k-means algorithm on the matrix product with 4 clusters (since the number of clusters of crime is 4), and draw another 3D plot where the color is defined by the clusters of the k-means.

set.seed(5)
km = kmeans(model_predictors, centers = 4)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=~factor(km$cluster))

The k-means clustering is mostly successful. One can see that there are 2 superclusters, while the clusters 1,2,4 (mostly) form their own subclusters under one of the superclusters. The cluster 3 is shared between the huge clusters. In the clusters for “crime”, “med_high” has this same property, while the other clusters are nicely separated into two superclusters. Thus, the k-means clustering plot with 4 clusters seems to give similar results compared to the lda.fit of the “crime” variable.